MCIC Wooster, OSU
2024-01-25
High-throughput sequencing (HTS)
Sequences 105-109 usually randomly selected DNA fragments (reads) at a time — two types:
High-throughput sequencing (HTS)
Sequences 105-109 usually randomly selected DNA fragments (reads) at a time — two types:
This lecture mostly talks about DNA sequencing, but:
This includes the indirect sequencing of RNA after reverse transcription to cDNA, as in nearly all “RNA-seq”.
Direct RNA sequencing is possible with some of the sequencing technologies we discuss, as I’ll briefly mention later, but is hard and not (yet) widely used.
Similarly, the shorthand “sequencing”, like in “high-throughput sequencing” in the title of this presentation, generally refers to DNA sequencing.
Modified after Pereira et al. 2020 (www.ncbi.nlm.nih.gov/pmc/articles/PMC7019349/)
Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time.
Sequencing is performed by synthesizing a new DNA strand in part with fluorescently-labeled nucleotides — a different color for each base (A, C, G, T).1
The final result is a chromatogram that can be base-called:
A $100 million dollar genome: the entire human genome (3 Gbp) was sequenced with Sanger!
https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost
Common current applications of Sanger sequencing include:
Examining variation among individuals or populations in one or more candidate or marker genes (for population genetics, phylogenetics, functional inferences, etc.)
Taxonomic identification of a sample
Amplification of a target DNA fragment is usually done with PCR
This means that you need to (approximately) know in advance short flanking sequences to the sequence of interest — primers for your PCR.
Introns are good targets to sequence: variable sequences flanked by conserved sequences (exons) in which primers can be designed.
Variant analysis (for population genetics/genomics, molecular evolution, GWAS, etc.):
Whole-genome “resequencing”
Reduced-representation libraries (e.g. RADseq, GBS)
RNA-seq (transcriptome analysis)
Other functional sequencing methods like Methyl-seq, ChIP-seq, etc.
Microbial community characterization
Metabarcoding
Shotgun metagenomics
Short-read (Illumina) HTS: 50-300 bp reads
Long-read HTS: longer & more variable read lengths (PacBio: 10-50 kbp, ONT: 10-100 kbp)
When are longer reads useful?
Genome assembly
Haplotype and large structural variant calling
Transcript isoform identification
Taxonomic identification of single reads (microbial metabarcoding)
When does read length not matter (as much)?
Read-as-a-tag: the goal is just to know a read’s origin in a reference genome, like in counting applications such as RNA-seq
SNP variant analysis
Currently, no sequencing technology is error-free, and several types of errors can occur:
Base call errors, e.g. a base that was called as an A may instead be a G.
Insertion or deletion (indel) errors
When the base calling software is not confident at all, it can also return Ns (= undetermined).
Quality scores in sequence data
When you get sequences from a high-throughput sequencer, base calls have typically already been made. Every base is also accompanied by a quality score (inversely related to the estimated error probability).
But made more challenging by natural genetic variation among and within (heterozygosity due to diploid genomes) individuals
Typical depths of coverage: at least 50-100x for genome assembly; 10-30x for resequencing.
100-300 bp reads with 0.1-0.2% error rates
More reads, lower per-base cost, and lower error rates than long-read sequencing1.
In a sequencing context, a “library” is a collection of nucleic acid fragments ready for sequencing.
In Illumina and other HTS libraries, the fragments number in the millions or billions and are often randomly generated from input such as genomic DNA — cf. Sanger sequencing:
Different library prep procedures are used depending on the type of sequencing (WGS, RAD-seq, RNA-seq, etc.) and HTS technology — and some include more specific fragment generation or selection. We’ll see the specific library prep steps for RNA-seq next week.
After library prep (here, for Illumina sequencing), each DNA fragment is flanked by several types of short sequences that together make up the “adapters”:
After talking about paired-end vs. single-end sequencing and the way Illumina sequencing works, we’ll take a closer look at the adapter components.
In Illumina sequencing, DNA fragments can be sequenced from both ends as shown below — this is called “paired-end” (PE) sequencing1:
When sequencing is instead single-end (SE), it simply means that there is no reverse read:
Paired-end sequencing is a way to effectively increase the read length. (In the resulting sequence files, the two reads in each pair are separate, but can be matched thanks to shared read IDs.)
Earlier, we saw that the maximum read length of Illumina is 300 bp but in paired-end sequencing, this becomes “2 x 300 bp”, etc.
The insert size varies — because the library prep protocol can aim for various sizes, and because of variation due to limited precision in size selection. It can be:
First, library fragments bind to a surface thanks to the adapters, and the DNA templates (the biological sequences) are then PCR-amplified to form “clusters” of identical fragments:
In the diagram above, for illustrative purposes:
Only a few nucleotides are shown (1 block = 1 nucleotide) — in reality, fragments are much longer
Only two templates => clusters are shown — in reality, there are millions
Then, sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:
The different templates within a cluster get out of sync because occasionally:
They miss a base incorporation
They incorporate two bases at once
Base incorporation may also terminate before the end of the template is reached
This error profile is why, for Illumina:
There are hard limits on read lengths
Base quality scores decrease along the read
Now that you have a better idea of how Illumina sequencing works, let’s take a closer look at the different components of the adapters flanking the DNA:
The technologies underlying the two main long-read HTS technologies are very different, but have some commonalities beyond long reads — they:
Error rates are changing
As a shorthand that was universally true until recently, I mentioned earlier that long-read HTS has higher error rate than short-read (Illumina) HTS.
However, error rates in one type of PacBio sequencing where individual fragments are sequenced multiple times (“HiFi”) are now lower than in Illumina.
A single strand of DNA passes through a nanopore —
what is measured is the electrical current, which depends on which combination of bases passes through:
https://www.genome.gov/genetics-glossary/Nanopore-DNA-Sequencing
Under development!
ONT constantly releases new flow cells with updated technology, which have led to large decreases in error rates over the past decade and even over the past two or so years.
At the same time, there is also a lot of development in the base-calling software. Unlike with Illumina or PacBio, it is common & useful to receive raw, not-basecalled data files: re-basecalling the same data a few years later with new software (versions) can make a substantial difference.
Advantages of ONT:
Low capital cost, portability (in-the-field sequencing!)
Read length not inherently limited, some extremely long reads
Lower cost per base
Disadvantages of ONT:
Higher error rates
Some systematic errors (e.g. homopolymers)
As methods facilitating genomics and transcriptomics research, genomes loom large in HTS. Specifically, most HTS applications either require a “reference genome” or involve its production.
What exactly does “reference genome” refer to? Three components to this phrase:
Taxonomic identity
This is typically considered at the species level, in which case it should involve the focal species. But:
If necessary, it is often possible to work with reference genomes of closely related species
Conversely, different reference genomes may exist for different subspecies/populations within a species
https://en.wikipedia.org/wiki/Genome_size
Key features:
Number of distinct chromosomes
Ploidy
Creating a reference genome has two main steps: assembly and annotation.
A few notes on genome assembly:
Most assemblies are not “chromosome-level”, i.e. they don’t have one contiguous sequence for each chromosome. Instead, they consist of contigs and scaffolds, which can number in the thousands.
Even chromosome-level assemblies are not 100% complete
With increasing usage & quality of long-read HTS, we are generating better assemblies
To create chromosome-level assemblies, other technologies typically also needed (Hi-C, optical mapping)
A few notes on genome annotation:
The first step is structural annotation: the identification of genes and other genomic features within the genome sequence
The second step is functional annotation: giving names and assigning functions to (mostly) genes
Whereas genome assembly typically does not borrow information from other organisms, genome annotation very heavily relies on that, based on the concept of sequence homology.
How is this data stored?
Both genome assemblies and annotations are typically saved in a single text file each — more on that soon.
All common genetic/genomic data files are plain-text, meaning that they can be opened by any text editor. However, they are often compressed to save space. The main types are:
FASTQ
The standard format for HTS reads — contains a quality score for each nucleotide.
SAM/BAM
An alignment format for HTS reads
FASTA files contain one or more (sometimes called multi-FASTA) DNA or amino acid sequences, with no limits on the number of sequences or the sequence lengths.
As mentioned, they are versatile, and are the standard format for:
Genome assembly sequences
Transcriptomes and proteomes (all of an organism’s transcripts & amino acid sequences, resp.)
Sequence downloads from NCBI such as a single gene/protein or other GenBank entry
Sequence alignments (but not from HTS reads)
The following example FASTA file contains two entries:
>unique_sequence_ID Optional description
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA
>unique_sequence_ID2
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAATGEach entry contains a header and the sequence itself, and:
> and are otherwise “free form” but usually provide an identifier (and sometimes metadata) for the sequenceFASTA file name extensions are variable:
Generic extensions are .fasta and .fa
Also used are extensions that explicitly indicate whether sequences are nucleotide (.fna) or amino acids (.faa)
FASTQ is the standard format for HTS reads.
Each read forms one FASTQ entry and is represented by four lines, which contain, respectively:
@ and e.g. uniquely identifies the read+ (plus sign)The quality scores we saw in the read on the previous slide represent an estimate of the error probability of the base call.
Specifically, they correspond to a numeric “Phred” quality score (Q), which is a function of the estimated probability that a base call is erroneous (P):
Q = -10 * log10(P)
For some specific probabilities and their rough qualitative interpretation for Illumina data:
| Phred quality score | Error probability | Rough interpretation |
|---|---|---|
| 10 | 1 in 10 | terrible |
| 20 | 1 in 100 | bad |
| 30 | 1 in 1,000 | good |
| 40 | 1 in 10,000 | excellent |
This numeric quality score is represented in FASTQ files not by the number itself, but by a corresponding “ASCII character”.
This allows for a single-character representation of each possible score — as a consequence, each quality score character can conveniently correspond to (& line up with) a base character in the read.
| Phred quality score | Error probability | ASCII character |
|---|---|---|
| 10 | 1 in 10 | + |
| 20 | 1 in 100 | 5 |
| 30 | 1 in 1,000 | ? |
| 40 | 1 in 10,000 | I |
A rule of thumb
In practice, you almost never have to manually check the quality scores of bases in FASTQ files, but if you do, a rule of thumb is that letter characters are good (Phred of 32 and up).
FASTQ files have no size limit, so you may receive a single file per sample, although:
With paired-end (PE) sequencing, forward and reverse reads are split into two files:
forward reads contain R1 and reverse reads contain R2 in the file name.
If sequencing was done on multiple lanes, you get one (SE) or two (PE) files per lane per sample.1
FASTQ files have the extension .fastq or .fq (but are commonly compressed, leading to fastq.gz etc.). All in all, having paired-end FASTQ files for 2 samples could look like this:
The GTF and GFF formats are tab-delimited tabular files that contain genome annotations, with:
One row for each annotated “genomic feature” (gene, exon, etc.)
One column for each piece of information about a feature, like its genomic coordinates
See the sample below, with an added header line (not normally present) with column names:
seqname source feature start end score strand frame attributes
NC_000001 RefSeq gene 11874 14409 . + . gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";
NC_000001 RefSeq exon 11874 12227 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; Some details on the more important/interesting columns:
+ (forward) or - (reverse) strandUsing specialized bioinformatics tools, you can align HTS reads (in FASTQ files) to a reference genome assembly (in a FASTA file).
The resulting alignments are stored in the SAM (uncompressed) / BAM (compressed) format.
SAM/BAM are tabular files with one line per alignment, each of which includes:
The position in the genome that the read aligned to
A mapping score based on the length of the alignment and the number of mismatches
The sequence of aligned the read itself
File conversions
FASTQ files can be converted to FASTA files (losing quality information) but not vice versa
SAM/BAM files can be converted to FASTQ files (losing alignment information) but not vice versa
Proteome FASTA files can be produced from the combination of a FASTA genome assembly and a GFF/GTF genome annotation